{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Redis数据库"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. 分布式爬虫"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "① 分布式爬虫（多台机器协作）的关键是：共享爬取队列，爬取队列在主机维护，各从机调度器统一从共享队列获取Request，即主机维护爬取队列，从机负责数据抓取、数据处理、数据存储。\n",
    "\n",
    "② Redis是非关系型数据库（Key-Value形式存储），结构灵活；是内存中数据结构存储系统，处理速度快，性能好；提供队列、集合等多重存储结构，方便队列维护。\n",
    "\n",
    "③ scrapy_redis在scrapy基础上实现了更多、更强大的功能，具体体现在：request去重，爬虫持久化，轻松实现分布式。\n",
    "\n",
    "④ 如何实现去重：借助Redis集合，在集合中存储每个Request的指纹。在向Request队列中加入Request之前先验证这个Request的指纹是否已经在集合中：\n",
    " \n",
    " - 已存在则不添加到队列 \n",
    " - 不存在则添加入队列并将指纹加入集合\n",
    " \n",
    "⑤ 如何防止中断（实现持久化）：在每台从机Scrapy启动时都会首先判断Redis Request队列是否为空：\n",
    "\n",
    " - 不为空则从队列中取得下一个Request执行爬取\n",
    " - 为空则重新开始爬取，第一台从机执行爬取队列中天下Request"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. 代码"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pickle\n",
    "class A:\n",
    "    pass\n",
    "aa =  A()\n",
    "print(aa)\n",
    "print(pickle.dumps(aa))\n",
    "\n",
    "\n",
    "a = [1,2,3]\n",
    "b = pickle.dumps(a)\n",
    "c = json.dumps(a)\n",
    "\n",
    "print(b)\n",
    "print(c)\n",
    "\n",
    "print(pickle.loads(b))\n",
    "print(json.load(c))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
